Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocs are healthy if service checks get healthy before task health #7944

Merged
merged 2 commits into from
May 13, 2020

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented May 13, 2020

This fixes a bug where an alloc is considered unhealthy if the alloc service checks pass before the tasks start up. This may occur in a case where a task takes a relatively long time to start (e.g. large image to download).

The bug was due to the check health detector loop wouldn't propagate the checks health again. This change here, ensures that we keep checking checks health until the final alloc health outcome is determined.

I've added a failing test first, so you can see the failure in https://circleci.com/gh/hashicorp/nomad/65986 . The test passes with the last commit.

Mahmood Ali added 2 commits May 13, 2020 07:43
Add a failing tests to show that if an alloc checks is marked healthy before the
alloc tasks start up, the alloc may be forever considered unhealthy.
Fix a bug where if the alloc check becomes healthy before the task health, the
alloc may never be considered healthy.
@notnoop notnoop requested review from lgfa29 and tgross May 13, 2020 12:10
@notnoop notnoop self-assigned this May 13, 2020
t.setCheckHealth(true)
if t.setCheckHealth(true) {
// final health set and propagated
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for this function says this is a "long lived goroutine". If we return here, how does the tracker detect service check failures that happen later on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"long lived goroutine" is a bit misnomer - it runs until the alloc is marked healthy or terminally unhealthy. Once an alloc is marked healthy, it never marks it unhealthy.

The exiting goroutine happens today already. setCheckHealth and setTaskHealth call t.cancelFn() which will cause both watchTaskEvents and watchConsulEvents as they handle t.ctx.Done().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@notnoop notnoop merged commit 3cb5551 into master May 13, 2020
@notnoop notnoop deleted the b-health-checks-after-task-health branch May 13, 2020 13:34
@github-actions
Copy link

github-actions bot commented Jan 6, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants